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Abstract 

Topic modelling techniques such as 
LDA have recently been applied to 
speech transcripts and OCR output. 
These corpora may contain noisy or 
erroneous texts which may undermine 
topic stability. Therefore, it is impor¬ 
tant to know how well a topic modelling 
algorithm will perform when applied to 
noisy data. In this paper we show that 
different types of textual noise will have 
diverse effects on the stability of differ¬ 
ent topic models. From these obser¬ 
vations, we propose guidelines for text 
corpus generation, with a focus on au¬ 
tomatic speech transcription. We also 
suggest topic model selection methods 
for noisy corpora. 

1 Introduction 

Topic modelling techniques are widely applied 
in text retrieval tasks. Such techniques have 
been previously applied to news sources (New¬ 
man et ah, 2006) , OCR (Tamura et ah, 2013), 
blogs (Yokomoto et ah, 2012) etc. in which 
the quality of the source text is high with low 
error rates (missing, misspelled, or incorrect 
terms or phrases). However with the improve¬ 
ments in terms of accuracy and the reduction 
in the cost of automatic speech transcription 
and optical character recognition (OCR) tech¬ 
nologies, the range of sources that topic mod¬ 
elling can now be applied to is growing. One 
artefact of such new text sources is their in¬ 
herent noise. In speech to text transcriptions, 
humans in general manage a WER of 2% to 4% 
(Fiscus et ah, 2007). When transcribing with 
a vocabulary size of 200, 5000 and 100000, 
the word error rates are 3%, 7% and 45% re¬ 
spectively. The best accuracy for broadcast 


news transcription 13% (Pallet, 2003), but this 
drops below 25.7% in conference transcription 
and gets worse in casual conversation (Fiscus 
et ah, 2007). These records show that the dif¬ 
ficulty of automatic speech recognition rises 
with vocabulary size, speaker dependency and 
level of crosstalk. 

Noise aside, many of these newly available 
sources contain rich and valuable information 
that can be analysed through topic modelling. 
For example, automatic speech transcription 
applied to call centre audio recordings is able 
to capture a level of detail that is otherwise 
unavailable unless the call audio is manually 
reviewed which is infeasible for large call vol¬ 
umes. In this case topic modelling can be ap¬ 
plied to transcribed text to extract the key 
issues and emerging topics of discussion. 

In this study we propose a method for sim¬ 
ulating various types of transcription errors. 
We then test the robustness of a popular topic 
modelling algorithm, Latent Dirichlet Alloca¬ 
tion (LDA) using a topic stability measure in¬ 
troduced by Greene et ah (2014) over a variety 
of corpora. 

2 Topic Modelling and Metrics 

Blei et ah (Blei et ah, 2003) introduced Latent 
Dirichlet Allocation (LDA) as a generative 
probabilistic model for text corpora. LDA reg¬ 
ulates the probabilistic distributions between 
document, topic and word and it is an unsu¬ 
pervised learning model. 

For the evaluation of topic models, we fol¬ 
low the approach by Greene et ah (2014) for 
measuring topic model agreement . 

We can denote a topic list as 5 = 
{Ri,Rk}, where Ri is a topic with rank 
i. An individual topic can be described as 
R = {Ti ,..., Tmj, where Ti is a term with rank 
I belong to the topic. Jaccard index (Jaccard, 



1912) compares the number of identical items 
in two sets, but it neglects ranking order. Av¬ 
erage Jaccard (AJ) similarity is a top-weighted 
version of the Jaccard index used to accom¬ 
modate ranking information. AJ calculates 
the average of the Jaccard scores between ev¬ 
ery pair of subsets in two lists. Based on AJ, 
we can evaluate the agreement of two sets of 
ranked lists (topic models). The topic model 
agreement score between Si and S 2 is a mean 
value of the top similarity scores between each 
cross pair of R. The agreement score is solved 
using the Hungarian method (Kuhn, 1955) 
and is constrained in the range [0,1], where 
a perfect match between two identical /c-way 
ranked sets results in 1 and a score 0 for non¬ 
overlapping sets. (Greene et ah, 2014) 

3 Datasets 

In this paper, we explore two datasets bbc and 
wikilow (Greene et ah, 2014) with different 
document size and corpus size. The bbc corpus 
includes general BBC news articles. This cor¬ 
pus contains 2225 documents in 5 topics and it 
uses 3121 terms. The wikilow corpus is a sub¬ 
set of Wikipedia and articles are labeled with 
hne-grained WikiProject sub-groups. There 
are totally 4986 documents in 10 topics and 
it uses 15411 terms. In both datasets the top¬ 
ics consist of distinct vocabularies which we 
expect LDA to detect. For example, the top¬ 
ics in bbc datasets are business, entertainment, 
politics, sport and technology. 

3.1 Textual Noise 

We artificially introduce noise into text to in¬ 
vestigate the performance of tope modelling 
over naturally noisy sources. We measure 
noise using word error rate (WER), a com¬ 
mon metric for measuring speech recognition 
accuracy. Moreover, WER has been used as 
a salient metric in speech quality analytics 
(Saon et ah, 2006) and spoken dialogue sys¬ 
tem (Cavazza, 2001). In Equation 1 WER is 
defined as the fraction between the sum of the 
number of substitutions S, the number of dele¬ 
tions D, the number of insertions and the num¬ 
ber of terms in reference N. 


Table 1: An example of Metaphone 

replacement in bbc corpus 


original corpus 

replaced corpus 

We are hoping to 
understand the 

creative industry... 

We are hoping to 
understand the 

Cardiff induced ... 


Table 2: Double Metaphone dictionary where 
terms are ranked with descending frequencies 


Metaphone 

matching terms 

ANTS 

industry, units, induced, wound, ... 

KRTF 

grateful, creative, Cardiff, ... 


The experiments investigate the robustness 
of topic models against each type of noise, and 
at which noise levels the output of a topic 
model is consistent with that of the original 
corpus. Deletion noise is introduced by ran¬ 
domly removing a portion of text in the cor¬ 
pus. The proportion of deletion ranges from 
0% to 50% and the term selection is based on 
uniform distribution. Insertion is introduced 
by adding a portion (0% to 50%) of frequent 
terms from a list of frequent English words 
with 7726 entries^. The probability of sam¬ 
pling of a certain term from the list is based 
on the term frequency. 

3.2 Metaphone Replacement 

We simulate speech recognition errors using 
Metaphone, a phonetic algorithm for indexing 
English words by their pronunciation (Philips, 
1990). Here we use the Double Metaphone 
(Black, 2014) algorithm in replacement and 
the replacement is on a one-to-one basis. This 
may not simulate the full range of errors pro¬ 
duced by ASR systems, in which the substi¬ 
tution may be a one-to-many or many-to-one^ 
mapping, but it was deemed sufficient for the 
current experiments. 

In this study we map Metaphone codes to 
frequent English words (examples in Table 2). 
Then in a given text document, we randomly 
select X percent terms and replace each by a 
term in the Metaphone map. The candidate 
terms sharing the same metaphone symbol are 
selected based on term frequencies. A frequent 
term has higher probability to be selected over 
a rare term (see Table 1). 


WER = 


S + D + I 
N 


( 1 ) 


^ http://ucrel.lancs.ac.uk/bncfreq/flists.html 
^e.g. recognise speech to wreck a nice beach. 




4 Experiments 

In our experiments with LDA, we aim to test 
the topic stability over different levels of noise 
and different numbers of topics. In order 
to produce consistent and repeatable results 
where each noise generation method relies on 
a degree of randomness with word deletion, 
insertion or substitution we generate multiple 
copies of each modified corpus using 5 random 
seeds. Similarly we perform 5 runs of each 
Mallet LDA (McCallum, 2002) topic model as 
the algorithm initial state is determined by a 
random seed. LDA hyperparameters are de¬ 
fined with default values, and each topic is 
represented by the top 25 terms. The final 
stability score on each level is a mean value of 
a number of runs with fixed seeds. 

4.1 LDA output 

Figure I and Figure 2 show the topic stability 
of the bbc and wikilow corpora with reference 
numbers of topics 5 and 10 each. For each level 
of topic model complexity, a downward slope 
indicates decreasing stability of topic models 
against increasing noise. 

In bbc corpus, topic model stability shows 
clear difference with different noise types. The 
model is especially robust against Deletion er¬ 
rors. When noise increases from 5% to 50%, 
the Hungarian agreement score of output top¬ 
ics only drops about 1% (for the fitted model 
with AT = 5 in Figure 1(a)). Checking each 
model in Figure 1(a), We can say that in bbc 
corpus the topic models are robust against ran¬ 
dom Deletion noise. 

In Figure 1(b), the model with 5 top¬ 
ics achieves the highest Hungarian agreement 
score at noise level 5% and 10%, but it drops 
significantly afterwards. The best and most 
stable topic model with noise higher than 15% 
of Insertion errors is the model with 15 topics, 
which is three times of the reference. Similar 
trend is observed with Metaphone replacement 
errors in Figure 1(c). The topic model with 
reference number of topics achieves the high¬ 
est stability when noise level is low. However, 
there are differences between Insertion and 
Metaphone errors in bbc corpus tests. With 
50% of Insertion errors, the model with 15 
topics achieves 56.4% agreement with original 
model, but the agreement is only 32.4% with 


Metaphone errors. In bbc corpus Metaphone 
errors are the most challenging case. 

In wikilow corpus we observe similar trends 
in Figure 1 and Figure 2 on specific types of 
noise. With Deletion errors, the topic model 
with reference number of topics is most sta¬ 
ble across noise levels. The difference of topic 
agreement scores is below 2% across noise lev¬ 
els. With Insertion and Metaphone errors, the 
topic model with reference number of topics is 
almost the best when noise is low but it drops 
below others when noise is higher than 15%. 

Although there are many similarities be¬ 
tween Figure 1 and 2, we like to mention two 
major differences across corpora. In Figure 
2(b) and 2(c), Hungarian scores of different 
topic models (number of topics) share simi¬ 
lar gradient of descending slope. However a 
few models from bbc corpus {K as 15, 20, 30) 
keep roughly stable Hungarian scores in Fig¬ 
ure 1(b). Another difference is that the most 
stable topic models against noise levels higher 
than 20% in Figure 1(b) and 1(c) both have 15 
topics, whereas the most stable models in wik¬ 
ilow have 30 topics in the same settings. How¬ 
ever, if we compare them with corresponding 
reference topic numbers K, the most stable 
topic models with high systematic errors all 
have K * 3 topics. Models with topic number 
higher than K * 3 are not optimal in Figure 
1(b) and 1(c). 

4.2 Discussion 

In Section 4.1 we observe topic model stabil¬ 
ity in two corpora and three types of noise. 
Here we can define a single measurement of 
topic stability across different settings. If a 
level of agreement is set as 70%, LDA is robust 
against Deletion noises up to 50% in both bbc 
and wikilow corpora. However, LDA model 
reaches this agreement level only on 10% In¬ 
sertion noises and on 5% Metaphone replace¬ 
ment noise. We see that Metaphone replace¬ 
ment and insertion are more severe challenges 
to topic models vs. deletion. 

Regarding deletion errors, we observe that 
the robustness of a topic model is mostly de¬ 
termined by the number of topics. When this 
matches the reference, the topic model is the 
most stable. However, this does not emerge 
with insertion and metaphone errors. Refer¬ 
ence topic models with 5 (bbc) and 10 (wik- 




(a) Deletion errors 


(b) Insertion errors 


(c) Metaphone errors 


Figure 1: LDA Hungarian scores against noise levels in hbc corpus (5 topics in reference) 


Hungarian agreement of Ref. and Deletion mixd texts on wikilow corpus Hungarian agreement of Ref. and Insertion mixed texts on wikilow corpi Hungarian agreement of Ref. and Metaphone mixed texts on wikilow corpus 



(a) Deletion eirors 


(b) Insertion errors 


(c) Metaphone errors 


Figure 2: LDA Hungarian scores against noise levels in wikilow corpus (10 topics in reference) 


ilow) topics achieve high stability only when 
noise is < 10%. With higher levels of noise, 
a more complex topic model exhibits higher 
stability. 15 (bbc) and 30 (wikilow) topics are 
the most robust at noise level 50%. 

A tentative explanation of the high stabil¬ 
ity of topic models against Deletion error con¬ 
cerns the LDA model. LDA takes term fre¬ 
quency into account. The probability of a 
word belonging to a topic is high if it appears 
frequently in one topic and seldom in other 
topics. Such a word is very likely to be an en¬ 
try in a topic model. If we randomly delete 
corpus terms, the scale of frequent terms is 
influenced trivially and these frequent terms 
still have a high probability of selection. All 
rare terms may be removed by deletion, but 
they have a low chance of appearing in the 
original topic model anyway. Therefore LDA 
model has high stability over various levels of 
deletion errors. Insertion and Metaphone re¬ 
placement introduces systematic noise, which 
changes the distribution of original texts with 
respect to frequency, thus having more impact 
on the LDA model. A high portion of gen¬ 
eral frequent terms may dilute the frequency 
of characteristic terms and add noisy terms to 


a topic model. However, a topic model with 
many more topics than the reference can deal 
with the effect of systematic errors. 

5 Conclusions 

We investigated how transcription errors af¬ 
fect the quality and robustness of topic models 
produced over a range of corpora, using a topic 
stability measure introduced a priori. We sim¬ 
ulate transcription errors from the perspective 
of word error rate and generated noisy cor¬ 
pora with deletion, insertion and Metaphone 
replacement. Topic models produced by LDA 
show high tolerance to deletion noise up to 
50% but low tolerance to insertion and meta¬ 
phone replacement errors. 

We hnd the robustness of topic models is 
mainly determined by the extent to which the 
distribution of original texts is modified. Dele¬ 
tion noise is introduced randomly and its effect 
on topic models is minor. Insertion and meta¬ 
phone replacement noise is systematic and un¬ 
dermines topic model stability to a large ex¬ 
tent. 

Moreover, the number of topics selected also 
affects topic agreement. With random noise or 
low-level systematic errors (below 20%), a cor- 



















rect or approximately correct number of top¬ 
ics brings the highest topic agreement scores. 
With high level systematic errors, topic mod¬ 
els with 3 times the correct number of top¬ 
ics are most robust. In some corpora, redun¬ 
dant number of topics helps the LDA model 
through severe systematic errors (Figure 1(b)). 
This complements previous work by Greene et 
ah (2014) who investigated how topic stability 
is influenced by number of topics over noise- 
free corpora. 

This suggests that transcribers should per¬ 
haps consider omitting words when the uncer¬ 
tainty is high. The topic model is less influ¬ 
enced with a random missing term than an 
erroneous replacement. For human consump¬ 
tion this may not be optimal, but in the case 
of output specifically intended for topic extrac¬ 
tion this approach makes sense. 
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